home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.mactech.com 2010
/
ftp.mactech.com.tar
/
ftp.mactech.com
/
challenge
/
13.10
/
ChallengeDisambig.sit.hqx
/
Challenge, Disambiguator
/
disambig.cp
Wrap
Text File
|
1997-08-22
|
32KB
|
1,117 lines
/*
July 1, 1997.
Submission to MacTech Programmer's challenge for July 97.
Copyright 1997, Ernst Munter, Kanata, ON, Canada.
"DISAMBIGUATOR"
Version 2: minor tweak to gain 2 to 3% in speed
Version 3: major tweak: gain 20% in speed
added length-byte to page data,
avoids unneeded calls to strlen()
Problem Statement
-----------------
Given a list of words and a findString, find all matching
words which start with findString. FindString may contain
wild cards ? * and +.
An initialization routine can be used to build a private
data structure to speed up searching.
The initial word list may or may not be sorted, but each
generated match list should be sorted.
Solution Strategy
-----------------
A string matching engine is needed to evaluate words from
the word list against the find string.
In the simplest case, the match list can be created by
comparing each word in the word list with the find string,
and then sorting match list. This is the procedure I
use when there is insufficient storage available (which
may happen if the word list is very short).
When there is enough storage available, a private index is
derived from the word list. This is based on word length
as well as bit maps representing 64-bit "digests" of words.
On searching with a particular find string, only words of
at least the known minimum length need be considered.
The find string is also represented with a digest.
The digest does not describe a word uniquely, and a final
direct comparison is needed to confirm a word for the
match list. But the digest representation helps in
filtering out many words and groups of words quickly that
need not be compared.
Alphabetical sorting is done on each generated match list.
The String Compare Engine
-------------------------
There are two possibilities:
- findString contains wild cards
- or it does not
If it does, string matching will be performed by a
virtual machine executing a very simple byte code program;
however, if there are no wild cards, an even simpler
comparison routine is used.
In any case, the Parser function processes the input
string (findString) into a program and an output string.
The output string is simply the input string with wild
cards removed.
The first value in the program is the minimum length of
any string required to match the original findString.
The subsequent program bytes are from the set of
instructions
- READ n read n chars without comparing
- COMPF n compare n chars,
on mismatch exit with FAIL
- COMPR n compare n chars,
on mismatch go back (n-1) and retry
The program always ends with SUCCEED
Special single byte forms of READ1 and COMPn (n<=7)
optimize for common cases (I hope), and also ensure that
the length of the program is guaranteed not to exceed the
length of findString. Since strlen(findString) is known
to be < 256, a fixed amount of memory can be provided for
the program on the CPU stack.
Private Data Organization
-------------------------
All words of the original word list are represented
grouped in pages of 32 (max) words of equal length.
Words of length > 31 are not differentiated and are
lumped in with pages of 31-char words.
Each page contains pointers to 32 words, their
individual word digests, and a summarizing page digest.
Word Digest
-----------
Two 32-bit words (28 bits used), one for single, the
other for multiple character occurrences in a word.
Digest-1 of a wordList word is a 32-bit computer word,
each bit indicating the presence or not of a character
value in the wordList word.
Bits in digest-2 are set only if 2 or more of a
particular character occur in the word.
The hash function to convert a character into a bit
position is tabulated in charTable[128] which maps all
legal characters into the range 1 to 28. The values in
charTable are roughly in order of frequency of occurrence
of letters in an English dictionary.
The digest for a find string is based on the non-wild
cards in the string, and thus will be a subset of any
match word matching the find string.
Pages
-----
The example below shows an excerpt from a typical page
based on the dictionary I used for a word list.
Each row represents the word digest for the referenced
word at the end of the row.
The last row is the page digest, the bit-OR of the columns.
Note that for reasons explained below, storage of the word
digests is by column, that is the bits shown below are stored
in 56 ulongs, each column representing the presence single or
multiple, of a particular character in all words of the page.
- single - - multiple -
_ qjxzkwvfybhmgpudclotransie= QJXZKWVFYBHMGPUDCLOTRANSIE
00000000000000001000010111100000000000000000000000001010 Tunisian
00000000000000001000010111100000000000000000000000000100 sustains
00000000000011001000010111100000000000000000000000000000 humanist
00000000000000011000010111100000000000000000000001000000 pantsuit
00000000000000011000010111100000000000000000000000000100 puissant
10000000000000000100010111100000000000000000000000001000 stand_in
00000000000000100100010111100000000000000000000000001000 standing
00000000010000000010010111100000000000000000000000010000 fanatics
00000000001000000010010111100000000000000000000001000000 sanctity
00000000011000000010010111100000000000000000000000000000 sanctify
00000000000000100010010111100000000000000000000000000100 castings
00000000000000100010010111100000000000000000000001000000 scatting
00000010000000100010010111100000000000000000000000000000 stacking
00000000000010100010010111100000000000000000000000000000 scathing
00000000000000010010010111100000000000000000000000010000 captains
00000000000000000110010111100000000000000000000000010000 antacids
00000000000000000001010111100000000000000000000000000010 initials
00000000000000000001010111100000000000000000000100000100 installs
00000000000000000001010111100000000000000000000000010010 Italians
00000000010000000001010111100000000000000000000000000010 finalist
00000000001000000001010111100000000000000000000000000010 salinity
00000000000000100001010111100000000000000000000000001000 slanting
00000000000000100001010111100000000000000000000100000000 stalling
00000000000000100001010111100000000000000000000000000010 tailings
00000010000000100001010111100000000000000000000000000000 stalking
00000000000100100001010111100000000000000000000000000000 blasting
00000000000100100001010111100000000000000000000000000000 stabling
00000000000000010001010111100000000000000000000000000010 tailspin
00000000000000110001010111100000000000000000000000000000 stapling
00000000000100001001010111100000000000000000000000000000 Istanbul
00000000000000101001010111100000000000000000000000000000 saluting
00000000000000011001010111100000000000000000000000000000 nuptials
page digest:
10000010011111111111010111100000000000000000000101011110
Before assembling the words into pages, a private
temporary word list is built, sorted on digest1. This
groups words of similar digests into a page, resulting in
sparser, more effective bit patterns for the page digests.
Word Filtering
--------------
To find matching words we use the page index to skip past
all pages containing words of less than the minimum length.
The page index scan then provides a fast screening function
by comparing the find digest with each page digest (a copy
of which exists in the index entry).
Once a promising page is identified on the basis of its
digest, it is scanned to find matching words.
The findString generates a signature, a string of unique
hash values of the original findString. To scan the digest
bits in a page for a particular findString, the signature
values are used to select columns of bits. These are
accumulated (bitwise AND). The resulting 32-bit value
identifies, by 1-bit positions, the words matching the
findString as far as digests go. The accumulation of
columns is abandoned as soon as the accumulator reads 0,
indicating that no words on the page will match.
If the accumulator pattern is not 0, one or several words
may be indicated, but they still need to be confirmed with
the alpha-numerical string matcher before being added to
the match list.
Sorting
-------
Sorting of the final match list is done using a form of
HeapSort, split into its two components, building of the
priority queue (my routine Send()) and down sifting (my
routine Sort()).
Special Cases
-------------
If findString contains ONLY wild cards, there is no need
to match any words except for length. Since word pages
are already organized by word length, it remains to send
all words from all pages of the minimum length given by
findString (the number of '+' and '?' symbols).
Optimizations
-------------
When there are no wild cards in findString, a simpler and
faster alphanumeric comparison is used.
The pageIndex entries contain a copy of the page digest1.
This avoids memory access to many pages which evidently
contain no matching words.
The hash function is chosen to represent character
frequency in order to cluster words in pages more
optimally. This feature relies on presorting the private
word list accordingly. The program will work correctly if
this sort is turned off, resulting in faster initialization,
but slower matches.
In my tests, presorting paid off.
There is some amount of code replication for expediency,
notably there are two customized copies of heap sort, and
two versions of generating word digests. These could
probably avoided without a great loss of speed.
A final optimization is the way word signatures are constructed.
These are derived from findString and used to scan the columns
in the page bit maps. Rather than convert the findString
alphanum characters to signature characters (range 1-56) in
the order of occurrence in findString, we arrange it so the
rarest characters are tested first.
Assumptions
-----------
All words, and findString are < 256 characters long.
There are no zero-length words in wordList.
The absolute minimum amount of private storage is 4 bytes.
This ensures that non-initialization of the tables can be
determined. In this mode, a sequential scan of wordList
is made.
For optimal operation, memory available for private data
should be at least 256 bytes + about 13 bytes per word.
The exact amount of storage needed depends on the length
distribution of the words in wordList.
FindString will be modified by Disambiguator.
Static memory of 384 bytes is used for lookup tables.
*/
/**** the published header file ****/
typedef unsigned long ulong;
void InitDisambiguator(
const char *const wordList[], /* words to match against */
ulong numWords, /* number of words in wordList */
void *privStorage, /* private storage preinitialized to zero */
ulong storageSize /* number of bytes of privStorage */
);
ulong /*numMatch*/ Disambiguator(
const char *const wordList[], /* words to match against */
ulong numWords, /* number of words in wordList */
void *privStorage, /* private storage */
ulong storageSize, /* number of bytes of privStorage */
char *findString, /* string to match, includes wild cards */
const char *matchList[] /* return matched words here */
);
/**** end of published header file ****/
#include <stdlib.h> // It seems unnecessary to mention
#include <string.h> // these with CW-Pro
// NOTE:
// Program Tuning Option
// Turning SORTWORDINDEX off reduces initialization time
// to 1/3 but increases search time by 50 to 80%
// The best setting depends on dictionary, findString
// patterns and the number of searches in a test run.
#define SORTWORDINDEX 1
typedef unsigned char Opcode;
// 'const char' is used a lot.
#define CCC const char
/*** Function prototypes ***/
static int Parse(char* spec,Opcode program[]);
static int SimpleMatch(CCC* x,char* spec,int minLen);
static int VMMatch(CCC* x,char* spec,Opcode* program);
static int Comp(CCC* ap,CCC* bp);
static ulong MakeSubDigest(
CCC* xString,ulong* dig2,char sig[]);
static void Sort(CCC* matchList[],long numMatch);
static void Send(CCC* matchList[],CCC* wp,ulong numMatch);
/*** Static allocations ***/
// charTable serves double duty:
// in parsing, it helps separate wild cards.
// in digest forming, it provides a sort order.
static char charTable[128] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 2, 3, 4, 5, 6, 7, 8, 9,10, 0, 0, 0, 0, 0, 0,
0,25,12,19,18,28,10,16,13,27, 4, 7,20,14,23,21,
15, 3,24,26,22,17, 9, 8, 5,11, 6, 0, 0, 0, 0, 1,
0,25,12,19,18,28,10,16,13,27, 4, 7,20,14,23,21,
15, 3,24,26,22,17, 9, 8, 5,11, 6, 0, 0, 0, 0, 0
};
// The table LSB stores the position of the least
// significant '1' bit in a byte (range 1 to 8),
// a zero-byte reports 0.
// This table is built from nested #defines
#define T1 1
#define T2 T1,2,T1
#define T3 T2,3,T2
#define T4 T3,4,T3
#define T5 T4,5,T4
#define T6 T5,6,T5
#define T7 T6,7,T6
#define T8 T7,8,T7
static char LSB[256]={0,T8};
// MISMATCH takes care of case insensitive matching
#define MISMATCH(a,b) ((a^b)&0xDF)
#define MIN(a,b) (a<b?a:b)
// HASH is just shorthand for charTable.
#define HASH(c) (charTable[c])
// Maximum length of strings
#define MAX_LEN 256
const enum { //compr must be last
SUCCEED=0,FAIL,RETRY,READ,READ1,COMPF,COMPR=COMPF+8};
static int VMMatch(CCC* x,int slen,char* spec,Opcode* program) {
// Virtual machine interpreter.
// VMMatch executes "program" with the help of
// "spec" to determine if string "x" matches.
int margin=slen-*program;
if (margin<0) return FAIL;
CCC* saveX;
char* saveSpec;
Opcode* savePC=0;
Opcode* pgm=program;
int matchLen,i;
x--;spec--;
for (;;) {
Opcode op=*++pgm;
switch (op) {
case SUCCEED:
case FAIL:
return op;
case COMPF+7: if (MISMATCH(*++x,*++spec)) goto retry;
case COMPF+6: if (MISMATCH(*++x,*++spec)) goto retry;
case COMPF+5: if (MISMATCH(*++x,*++spec)) goto retry;
case COMPF+4: if (MISMATCH(*++x,*++spec)) goto retry;
case COMPF+3: if (MISMATCH(*++x,*++spec)) goto retry;
case COMPF+2: if (MISMATCH(*++x,*++spec)) goto retry;
case COMPF+1: if (MISMATCH(*++x,*++spec)) goto retry;
break;
case COMPF:
matchLen=*++pgm;
for (i=0;i<matchLen;i++) {
if (MISMATCH(*++x,*++spec)) goto retry;
}
break;
retry:
// this mysterious retry after mismatch takes care of
// "*x?y" to find word "xxay", matching the second x
if ((savePC==0)||(0==margin--)) return op;
pgm=savePC-1;
x=saveX+1;
spec=saveSpec;
break;
case COMPR+7:
case COMPR+6:
case COMPR+5:
case COMPR+4:
case COMPR+3:
case COMPR+2:
case COMPR+1:
savePC=pgm;
matchLen=op-COMPR;
goto try_again;
case COMPR:
savePC=pgm;
matchLen=*++pgm;
try_again:
for (i=1;i<=matchLen;i++) {
if (MISMATCH(x[i],spec[i])) {
if (margin--) {
++x;
goto try_again;
}
return op;
}
}
saveX=x;
saveSpec=spec;
x+=matchLen;
spec+=matchLen;
break;
case READ:
x+=*++pgm;
break;
case READ1:
++x;
break;
default:
return op;
}
}
}
// The class WordIndex holds a pointer to a word from
// wordList and computes digests and signature for it.
struct WordIndex {
ulong digest1;
CCC* word;
void Init(CCC* wordx) {
CCC* wp=wordx;
int s=HASH(*wp);
ulong dig1=1L<<s;
for (;;) {
int c;
if (0==(c=*++wp)) break;
s=HASH(c);
ulong bit=1L<<s;
dig1 |= bit;
}
digest1=dig1;
word=wordx;
}
ulong MultiDigest(char* sig) {
// Returns digest2 for the word,
// and computes its signature
CCC* wp=word;
int s=HASH(word[0]);
ulong digest2=0;
ulong dig1=1L<<s;
sig[0]=s;
for (;;) {
int c;
if (0==(c=*++wp)) break;
s=HASH(c);
ulong bit=1L<<s;
if (0==(dig1 & bit)) {
*++sig=s;
dig1 |= bit;
} else if (0==(digest2 & bit)) {
*++sig=s+28;
digest2 |= bit;
}
}
*++sig=0;
return digest2;
}
int CompDigest(WordIndex wx) {
return (wx.digest1)-(digest1);
}
};
// The class Page holds pointers to 32 words from wordList
// which are of the same length (words >= 31 together).
// Page also contains both word digests for each word,
// in the bits[] array, stored in signature oriented columns.
// A page provides string matching for the 32 words it owns.
struct Page{
CCC* word[32];
char len[32];
int fill;
Page* next;
ulong pageDigest1;
// ulong pageDigest2; overlay on unused bits[0]
#define pageDigest2 bits[0]
ulong bits[57];
void Init(Page* following) {
// clears all and sets linkage.
// memory may already be precleared, but not necessarily
// since pages may overlay the temporary wordIndex
memset(this,0,sizeof(Page));
next=following;
}
int IsFull() {return (fill>=32);}
void Add(WordIndex* wip,int length) {
// Adds one word to a page,
// ORs the word digests into the page digests,
// Also ORs the horizontal bit slice representing word digests
// into the bits array.
char sig[64];
ulong curbit=1L<<fill;
len[fill]=length;
word[fill++]=wip->word;
pageDigest1 |= wip->digest1;
pageDigest2 |= wip->MultiDigest(sig);
int c;
char* sigp=sig;
while (0 !=(c=*sigp++)) {
bits[c] |= curbit;
}
}
ulong Match(char sig[])
// Accumulates vertical bit slices from a given signature
// Returns a bit map of likely candidate words
{
int c=sig[0];
ulong acc=bits[c];
while ((acc) && (0 != (c=*++sig))) {
acc &= bits[c];
}
return acc;
}
ulong SendSelectionVM(
ulong acc,
char* findString,
Opcode program[],
CCC* matchList[],
ulong numMatch)
{
// Uses the "acc" bitmap to identify words for full
// string matching, and sends matching words to the matchList
// Uses the LSB array to quickly isolate bits in acc.
CCC** wp=word-1;
char* lenp=len-1;
do {
ulong accLo=acc & 0xFF;
if (accLo) {
int j=LSB[accLo];
acc>>=j;
wp+=j;lenp+=j;
if (SUCCEED==VMMatch(*wp,*lenp,findString,program))
Send(matchList,*wp,numMatch++);
} else {
acc>>=8;
wp+=8;
}
} while (acc);
return numMatch;
}
ulong SendSelection(
ulong acc,
char* findString,
Opcode program[],
CCC* matchList[],
ulong numMatch)
{
// Same as SendSelectionVM, but uses simpler string matching
CCC** wp=word-1;
int minLen=program[0];
do {
ulong accLo=acc & 0xFF;
if (accLo) {
int j=LSB[accLo];
acc>>=j;
wp+=j;
if (SUCCEED==SimpleMatch(*wp,findString,minLen))
Send(matchList,*wp,numMatch++);
} else {
acc>>=8;
wp+=8;
}
} while (acc);
return numMatch;
}
ulong SendAll(CCC* matchList[],ulong numMatch) {
// Sends all words on page to matchList
for (int i=0;i<fill;i++)
Send(matchList,word[i],numMatch++);
return numMatch;
}
};
// The class PageIndex contains a pointer to a page, and
// keeps a copy of the page digest1.
// During the scan, PageIndex provides a screening function
// to eliminate unnecessary page accesses if digest1 can
// already rule out all words on a page.
struct PageIndex {
ulong digest1;
Page* page;
void Init(Page* thePage) {
digest1=thePage->pageDigest1;
page=thePage;
}
Page* Screen(ulong findDigest1,ulong findDigest2) {
if ((findDigest1 ^ (findDigest1 & digest1)))
return 0;
return page;
}
};
// The class PrivateData is the main structure mapped into the
// private memory space allocated on the heap.
// The pageGroup[] array holds pointers to linked lists of
// pages, according to word length.
// Once all pages are assembled, the page addresses are remapped
// into a linear page index, sorted by ascending word length
// The indexGroup[i] array points to the the first page of each
// group of pages of a given word length i.
struct PrivateData {
// Page* nextPage; overlay on unused pageGroup[0]
#define nextPage pageGroup[0]
Page* pageGroup[32];
// PageIndex* endOfPageIndex; overlay on indexGroup[0]
#define endOfPageIndex indexGroup[0]
PageIndex* indexGroup[32];
int bottom[1];
void Init(
CCC * const wordList[],
ulong numWords,
ulong storageSize)
{
// Must have at least this much storage to build minimal
// page index system
ulong minimumStorage=
sizeof(pageGroup) +
sizeof(indexGroup) +
numWords*sizeof(WordIndex) +
sizeof(PageIndex) +
sizeof(Page);
if (storageSize<minimumStorage) {
nextPage=0;
return;
}
#if SORTWORDINDEX
// Build priority queue of word index items
// This is the insertion step of heap sort
WordIndex* wordIndexList=(WordIndex*)bottom;
int i,j,n;
for (n=0;n<numWords;n++) {
WordIndex wx;
wx.Init(wordList[n]);
WordIndex* base=wordIndexList-1;
WordIndex z;
i=n+1,j=i>>1;
while ((j>0)&&(wx.CompDigest(z=base[j])>0)) {
base[i]=z;
i=j;j=i>>1;
}
base[i]=wx;
}
// Unload priority queue, step 2 of heap sort
nextPage=(Page*)this + storageSize/sizeof(Page) - 1;
WordIndex* wip;
WordIndex* base=wordIndexList-1;
WordIndex x;
ulong numUnsorted=numWords;
wip=base+numUnsorted+1;
if (numUnsorted>1) do {
i=1;j=2;
x=base[numUnsorted--];
*(--wip) = base[1];
if (numUnsorted<=1) {
base[1]=x;
break;
}
while (j<=numUnsorted) {
if ((j<numUnsorted)
&& (base[j].CompDigest(base[j+1])<0))
j++;
if (x.CompDigest(base[j])>=0)
break;
base[i]=base[j];
i=j;j+=j;
}
base[i]=x;
} while(1);
#else
// No sorting, just copy and initialize word index
// in whatever order wordList is in.
// This reduces Initialization time at the expense
// of search efficiency because more pages will
// have to be filtered.
WordIndex* wordIndexList=(WordIndex*)bottom;
int n;
for (n=0;n<numWords;n++)
wordIndexList[n].Init(wordList[n]);
WordIndex* wip;
#endif
// Unload word index and build pages in linked lists
// based on word length
nextPage=(Page*)this + storageSize/sizeof(Page) - 1;
wip=wordIndexList+numWords;
while (wip>wordIndexList) {
wip--;
if (0==InsertWordInPage(wip)) return;
}
// Map linked lists to a linear index of pages by length
if (0==BuildIndex()) {
nextPage=NULL; // not enough storage
return;
}
}
int InsertWordInPage(WordIndex* wip) {
// Inserts one word in a page, opens a new page if
// none exists or if the current page is full.
int len=strlen(wip->word);
if (len==0) return 1; // ignore 0-length words
int cutLen=MIN(31,len);
Page* page=pageGroup[cutLen];
if ((page==0) || (page->IsFull())) {
Page* temp=page;
page=nextPage--;
// test if the bottom of the growing page array collides
// with the top of the shrinking word index array
if (page <= (Page*)wip) {
// not enough storage, we have to bail out
nextPage=NULL;
return 0;
}
page->Init(temp);
pageGroup[cutLen]=page;
}
page->Add(wip,len);
return 1;
}
PageIndex* BuildIndex() {
// Builds the page index, starting at this->bottom,
// overwriting storage previously used by word index.
PageIndex* pi=(PageIndex*)bottom;
PageIndex* piTop=(PageIndex*)nextPage;
for (int len=1;len<32;len++) {
Page* page=pageGroup[len];
indexGroup[len]=pi;
while (page) {
if (pi>=piTop) return 0;
pi++->Init(page);
page=page->next;
}
}
return (endOfPageIndex=pi);
}
// Use 'nextPage' as a flag to indicate if we initialized
// nextPage will be NULL if we ran out of storage during
// the initialization.
void* IsInitialized(){return nextPage;}
ulong SendAll(CCC* matchList[],ulong minLen) {
// Sends all words >= minimum length from all pages
ulong numMatch=0;
for (int len=MIN(31,minLen);len<32;len++) {
Page* page=pageGroup[len];
while (page) {
numMatch=page->SendAll(matchList,numMatch);
page=page->next;
}
}
return numMatch;
}
ulong CollectVM(
char* findString,
CCC* matchList[],
Opcode program[])
{
// CollectVM scans all pages above the minimum length,
// matches using the virtual machine string matcher,
// and sends matched words into matchList
char sig[64];
ulong findDigest2;
ulong findDigest1=MakeSubDigest(findString,&findDigest2,sig);
ulong numMatch=0;
PageIndex* pi=indexGroup[MIN(31,program[0])];
for (;pi<endOfPageIndex;pi++) {
Page* page=pi->Screen(findDigest1,findDigest2);
if (page) {
ulong acc=page->Match(sig);
if (acc) numMatch=page->SendSelectionVM(
acc,findString,program,matchList,numMatch);
}
}
return numMatch;
}
ulong Collect(
char* findString,
CCC* matchList[],
Opcode program[])
{
// Collect is similar to CollectVM but uses the simpler
// string matching function.
char sig[64];
ulong findDigest2;
ulong findDigest1=MakeSubDigest(findString,&findDigest2,sig);
ulong numMatch=0;
PageIndex* pi=indexGroup[MIN(31,program[0])];
for (;pi<endOfPageIndex;pi++) {
Page* page=pi->Screen(findDigest1,findDigest2);
if (page) {
ulong acc=page->Match(sig);
if (acc) numMatch=page->SendSelection(
acc,findString,program,matchList,numMatch);
}
}
return numMatch;
}
};
void InitDisambiguator(
// InitDisambiguator to be called from the application
CCC *const wordList[],
ulong numWords,
void *privStorage,
ulong storageSize
) {
// Just sets up the private data structure
PrivateData* PD=(PrivateData*)privStorage;
PD->Init(wordList,numWords,storageSize);
}
ulong Disambiguator(
// Disambiguator to be called from the application
CCC *const wordList[],
ulong numWords,
void *privStorage,
ulong storageSize,
char *findString,
CCC *matchList[]
) {
PrivateData* PD=(PrivateData*)privStorage;
// Allocates space for the longest possible VM program on
// the stack and parses find string
Opcode program[MAX_LEN];
int useVM=Parse(findString,program);
// findstring is now stripped of wildcards
ulong numMatch;
if (PD->IsInitialized()) {
// should always be true except ..
if ((*findString) || (program[0]>31)) {
// Normal findString with alphanum characters
// or wildCards only, but minLength > 31
if (useVM) {
// wild cards detected, must use VM matching
numMatch=PD->CollectVM(findString,matchList,program);
} else {
// no wild cards, can use faster simple matching function
numMatch=PD->Collect(findString,matchList,program);
}
} else {
// No characters to match, minimum length <= 31
// we can simply send all >= minimum length
numMatch=PD->SendAll(matchList,program[0]);
}
} else {
// if we get here, PD was not initialized, and we have to
// just scan the entire word list for matches
// We don't really expect to get here except with
// extremely short word lists.
numMatch=0;
if ((*findString) || (program[0]>31)) {
if (useVM) {
for (ulong i=0;i<numWords;i++)
if (SUCCEED==
VMMatch(wordList[i],strlen(wordList[i]),
findString,program))
Send(matchList,wordList[i],numMatch++);
} else {
for (ulong i=0;i<numWords;i++)
if (SUCCEED==
SimpleMatch(wordList[i],findString,program[0]))
Send(matchList,wordList[i],numMatch++);
}
} else {
for (ulong i=0;i<numWords;i++)
if (strlen(wordList[i]) >= program[0])
Send(matchList,wordList[i],numMatch++);
}
}
// Final sort (step 2 of heap sort) for match list
Sort(matchList,numMatch);
return numMatch;
}
/******* Static Functions **********/
static int SimpleMatch(
CCC* x,
char* findString,
int minLen) {
// alphanumeric matche of x against findString,
// no wild cards allowed
x--;findString--;
for (int i=0;i<minLen;i++) {
if (MISMATCH(*++x,*++findString)) return FAIL;
}
return SUCCEED;
}
static void Send(
// Inserts word wp in matchList as a priority queue
// in preparation for sorting later
CCC* matchList[],
CCC* wp,
ulong numMatch) {
CCC** base=matchList-1;
CCC* z;
long i=numMatch+1,j=i>>1;
while ((j>0)&&Comp(wp,z=base[j])>0) {
base[i]=z;
i=j;
j=i>>1;
}
base[i]=wp;
}
static int Comp(CCC* ap,CCC* bp) {
// Alphabetic case insensitive string comparator
char a=*ap;
if (a) do {
char b=*bp;
if (MISMATCH(a,b)) {
return (a | 0x20) - (b | 0x20);
}
a=*++ap;
bp++;
} while (a);
return -1;
}
static int Parse(char* findString,Opcode program[]) {
// Scans the findString and creates a bytecode program.
// All wild cards are stripped from findString.
// program[0] contains the minimum length of words to match,
// the rest of program[] contains tokens from the enum set.
// Returns the number of wild cards removed from findString.
#define EMIT(x) {*++pgm=x;}
Opcode* pgm=program;
char* newFindString=findString;
int onFailure=FAIL;
char c=*findString;
int n,minLen=0,usesWild=0;
for (;;) {
if (0==charTable[c])
switch(c) {
case '?':
n=0;
do {n++;} while ('?'==(c=*++findString));
if (n==1) EMIT(READ1) else {
EMIT(READ);
EMIT(n);
}
minLen+=n;
usesWild++;
break;
case '+':
minLen++;
EMIT(READ1);
case '*':
c=*++findString;
onFailure=RETRY;
usesWild++;
break;
default:
EMIT(SUCCEED);
*newFindString=0; //zero terminate the new FindString
program[0]=minLen;
return usesWild;
} else {
n=0;
do {
n++;
*newFindString++=c; // copy non-wilds to new string
} while (charTable[c=*++findString]);
if (FAIL==onFailure) {
if (n<=7) EMIT(COMPF+n) else {
EMIT(COMPF);
EMIT(n);
}
} else {
if (n<=7) EMIT(COMPR+n) else {
EMIT(COMPR);
EMIT(n);
}
}
onFailure=FAIL;
minLen+=n;
}
}
}
static ulong MakeSubDigest(
CCC* xString,
ulong* dig2,
char sig[]) {
// Creates a pair of word digests and a signature from
// findString. No duplicate signature characters.
int s=HASH(xString[0]);
ulong digest1=1L<<s;
ulong multi=0;
for (;;) {
int c;
if (0==(c=*++xString)) break;
s=HASH(c);
ulong bit=1L<<s;
if (0==(digest1 & bit)) {
digest1 |= bit;
} else {
if (0==(multi & bit)) {
multi |= bit;
}
}
}
// Make the signature in bit order, that is with the
// less frequent English letters first, so that page
// scanning will fail as soon as possible if no word
// in the page will match.
ulong bit1=2,bit2=2;
ulong single=digest1 & (~multi);
int s1,s2;
// Test rarest letter combinations first
if (multi) {
for (s2=29;s2<=56;s2++) {
if (multi & bit2) *sig++=s2;
bit2 += bit2;
}
}
for (s1=1;s1<=28;s1++) {
if (single & bit1) *sig++=s1;
bit1 += bit1;
}
*sig++=0;
*dig2=multi;
return digest1;
}
static void Sort(CCC* matchList[],long numMatch) {
// Heap sort step 2, used for final sorting of the
// match list.
CCC** sList=matchList-1;
CCC* x;
int i,j;
CCC** b=sList+numMatch+1;
if (numMatch>1) do {
i=1;j=2;
x=sList[numMatch--];
*(--b) = sList[1];
if (numMatch<=1) {
sList[1]=x;
break;
}
while (j<=numMatch) {
if ((j<numMatch)
&& (Comp(sList[j],sList[j+1])<0))
j++;
if (Comp(x,sList[j])>=0)
break;
sList[i]=sList[j];
i=j;j+=j;
}
sList[i]=x;
} while(1);
}